Welcome to the second part of the tutorial where we are going to have a look at another popular social media platform YouTube
library(knitr)
#library(magick)
library(png)
library(tuber)
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.6
## ✔ tidyr 0.8.1 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## Warning: package 'dplyr' was built under R version 3.5.1
## ── Conflicts ─────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(tidytext)
library(grid)
#library(emo)
library(icon)
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
library(psych)
##
## Attaching package: 'psych'
## The following object is masked from 'package:icon':
##
## fa
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(psych)
options(stringsAsFactors = FALSE)
tuberOnce you setup your YouTube access via the Google Developer Console, you can connect to the API and download data.
A reminder: to connect you use yt_oauth().
app_id = "667235664106-a5e6ng7pna0ptv7qoqnrsn0ldm1i68a8.apps.googleusercontent.com"
app_secret = "8UV8kvt8cZ75EXg-ubOPBjDJ"
#connect
yt_oauth(app_id=app_id, app_secret=app_secret, token='', cache=FALSE)
By default the function looks for .httr-oauth in the working directory in case you connected before. If it doesn’t find it, it passes an application ID and a secret. If you do not want the function to use cache, set cache=FALSE.
The function launches a browser to allow you to authorize the application
Once you are connected, we can searching!
results <- yt_search("World Cup 2018")
kable(results[1:4,1:5])
| video_id | publishedAt | channelId | title | description |
|---|---|---|---|---|
| qlZaiBuOaz4 | 2017-09-09T19:17:50.000Z | UC07CHp5ikd-AyafR4J2LWkA | No T20 World Cup In 2018, Australia Will host Next WT20 in 2020 | It is now officially confirmed by ICC that the 2018 season of ICC WT20 Will not gonna happen,Australia now awarded as the next destination of icc t20 world cup … |
| 9BekP8KQBtY | 2018-07-10T19:28:29.000Z | UCH0lge-0lFszfPRqFkaoXpw | Francia Vs Belgio - World CUP 2018 Semifinale “Calci di Rigore” | PES 2018 Patch [Giù] | PES 2018 MiniFaces Pack World Cup Teams http://freedownloadtn.altervista.org/pes%202018%20minifaces%20pack%20world%20cup%20teams.php … |
| s-AYocn9ulU | 2018-07-10T14:00:08.000Z | UCH0lge-0lFszfPRqFkaoXpw | Juventus (CR7⭐) Vs Inter Classic (Adriano) impressionante bomba di Adriano | PES 2018 Patch [Giù] | PES 2018 MiniFaces Pack World Cup Teams http://freedownloadtn.altervista.org/pes%202018%20minifaces%20pack%20world%20cup%20teams.php … |
| aWxbphbmDts | 2018-01-22T15:48:13.000Z | UCqMp3kP5EP0YXANgY-I43JA | T20 World Cup 2018 Full Schedule And Time Table ICC World 20 20 Cup | ICC T20 World Cup 2018 Full Schedule And Time Table in urdu hindi ICC t20 World Cup 2018 ICC t20 World Cup full time table Hind Pakistan Vs India T20 … |
The yt_search function returns a data.frame with 16 elements, including
video_id: id of the video that can be used later to get its statspublishedAt: date of the publicationchannelId: id of the channel that can be used later to get access to the channeltitle: title of the videodescription: a short description of the vidlchannelTitle: a more “user-friendly” name of the channelThe yt_search function takes parameters that let your specify your search. The most popular ones are:
term: a search term to be used, searches across different types, including video (the default one), channel and playlist.max_results: maximum number of items to be returned with the default set to 50 and maximum allowed = 500channel_id: search videos on the channel and requires a channel idtype: specify the type of the media and can take one three values: video (default), channel, playlist.published_after and published_before: specifies the timeframe for searchNow let’s have a look at comments of the video. We can pick any video in our search and have a look at its comments, using video_id to get them
results <- get_comment_threads(c(video_id="qlZaiBuOaz4"))
kable(results[1:10, c("authorDisplayName", "textDisplay")])
| authorDisplayName | textDisplay |
|---|---|
| Suraj Yadav | Sale jhooth bolta hai faltu channal ko subscribe nhi kiya jata |
| Meena Srivastava | Nice video bro |
| bdayal singh | My favorite tournament is T20 world cup |
| DHRUV SAWANT | 2020 me hoga |
| Nikhil Yadav | GU hai TU guuu |
| Vandana2 Rajpurohit | yash bhai app tho ipl ke hi video banate ho aap world cricket ke upar bhi video banao |
| Arjun Bhandari | Ipl k bareme batao jiii rcb k bareme kya change hoga |
| Vikas Mathur | Bhosdee ke tuje bada pata h |
| Munmun Samanta | chip Kar faltu |
| Mohammed Rafique | Asia cup kab hoga bhai |
You specify your video id in the filter argument. You can also use this argument to download comments for a specific channel. So, let’s have a look at your favourite YouTube channel. Which one is your favourite? Mine is Bloomberg
Channel id for Bloomberg is UCUMZ7gohGI9HcU9VNsr2FJQ. We are going to use more generic function, list_channel_resources that returns a list of requested channel resources.
a <- list_channel_resources(filter = c(channel_id = "UCUMZ7gohGI9HcU9VNsr2FJQ"), part="contentDetails")
# Uploaded playlists:
playlist_id <- a$items[[1]]$contentDetails$relatedPlaylists$uploads
# Get videos on the playlist
vids <- get_playlist_items(filter= c(playlist_id=playlist_id))
# Video ids
vid_ids <- as.vector(vids$contentDetails.videoId)
# Function to scrape stats for all vids
get_all_stats <- function(id) {
get_stats(id)
}
# Get stats and convert results to data frame
res <- lapply(vid_ids, get_all_stats)
res_df <- rbindlist(lapply(res, data.frame), fill=TRUE)
head(res_df)
## id viewCount likeCount dislikeCount favoriteCount commentCount
## 1: j8x7TIm4rK4 5521 320 64 0 181
## 2: 5afmFgfdFzA 82051 886 71 0 173
## 3: 1zDZmIDbDO0 38765 1344 76 0 167
## 4: orR8FSlKGo0 81144 1131 183 0 322
## 5: -pU3lfz2xN8 44859 1180 58 0 155
## 6: l9RWTMNnvi4 51940 1933 23 0 111
But… I guess.. it’s getting too busy for the day, so let’s take a break and have a look at Victoria’s Secrets
It’s Brisbane, Australia after all!
To locate the channel we need to use channel id, not its name.
Obtaining data from YouTube channel and have a look at its stats. is usually done through channel_id, which is not the same as the YouTube name you see in the YouTube link.
There are several ways to obtain your channel_id (YouTube name):
https://www.googleapis.com/youtube/v3/channels?key={YOUR_API_KEY}&forUsername={USER_NAME}&part=id where YOUR_API_KEY is the key you create in the google developer account USER_NAME is the YouTube channel (username)externalId or data-channel-external-id. The value there will be the channel id.Now that we have the channel id, lets get a list of its videos and have a look at its stats
channel_id<-"UChWXY0e-HUhoXZZ_2GlvojQ"
videosVS = yt_search(term="", type="video", channel_id = channel_id)
kable(videosVS[1:4,1:5])
| video_id | publishedAt | channelId | title | description |
|---|---|---|---|---|
| HLo_3GfHjps | 2012-11-30T17:10:00.000Z | UChWXY0e-HUhoXZZ_2GlvojQ | Miranda Kerr and Bruno Mars Backstage at the 2012 Victoria’s Secret Fashion Show | Get ready for dimples, winks and giggles when Miranda Kerr quizzes Bruno Mars backstage at the 2012 Victoria’s Secret Fashion Show. Catch them both on the … |
| 76gf3YeGf1s | 2010-11-30T17:49:42.000Z | UChWXY0e-HUhoXZZ_2GlvojQ | Before I Was a Supermodel: Behati Prinsloo | On the eve of the 2010 Fashion Show, Supermodel Behati Prinsloo shares some memories about growing up and becoming a model. |
| wMYYQkwLDqM | 2013-10-17T15:27:54.000Z | UChWXY0e-HUhoXZZ_2GlvojQ | Candice Swanepoel Meets the Royal Fantasy Bra | Go on set as Victoria’s Secret Angel Candice Swanepoel gets to see and try on the Royal Fantasy Bra for the very first time. Look for Candice and the $10 million … |
| anhEcKWRzyE | 2017-10-16T17:06:39.000Z | UChWXY0e-HUhoXZZ_2GlvojQ | Road to the Runway: Episode 1 – Castings | The 2017 Victoria’s Secret Fashion Show magic kicks off with castings. Go behind the scenes as this year’s models make their way to wearing those … |
#get channel stats
statsVS<-get_channel_stats(channel_id=channel_id)
## Channel Title: Victoria's Secret
## No. of Views: 266052763
## No. of Subscribers: 1505487
## No. of Videos: 743
statsVSSelected <- as.vector(statsVS$statistics)
results<-do.call(rbind, statsVSSelected)
head(results)
## [,1]
## viewCount "266052763"
## commentCount "0"
## subscriberCount "1505487"
## hiddenSubscriberCount "FALSE"
## videoCount "743"
The get_channel_stats function is quite straightforward. It take channel_id as an argument and returns a nested list. We can select the items we need from the list and convert it to a data.frame
Now that we have a list of videos from VS channel, let’s download stats for each video. As an example let’s do first 10
videosVS_sample<-videosVS[1:10,]
videoStatsVS = lapply(as.character(videosVS_sample$video_id), function(x){
get_stats(video_id = x)
})
videoStatsVS_df = do.call(rbind.data.frame, videoStatsVS)
head(videoStatsVS_df)
## id viewCount likeCount dislikeCount favoriteCount commentCount
## 2 HLo_3GfHjps 2265269 16365 277 0 1826
## 21 76gf3YeGf1s 886327 5478 72 0 299
## 3 wMYYQkwLDqM 2491692 17840 299 0 1306
## 4 anhEcKWRzyE 2173214 37751 572 0 1672
## 5 DkcctlhyWEs 954716 9375 202 0 402
## 6 LwelJNogl-M 883488 7721 88 0 200
The function uses video ids, but does not return video title and dates, which we can add ourselves and do some clean-up
videoStatsVS_df$title = videosVS_sample$title
videoStatsVS_df$date = videosVS_sample$date
library(tidyverse)
videoStatsVS_df = as.tibble(videoStatsVS_df) %>%
mutate(viewCount = as.numeric(as.character(viewCount)), #originally as factor
likeCount = as.numeric(as.character(likeCount)),
dislikeCount = as.numeric(as.character(dislikeCount)),
commentCount = as.numeric(as.character(commentCount)))
head(videoStatsVS_df)
## # A tibble: 6 x 7
## id viewCount likeCount dislikeCount favoriteCount commentCount title
## <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
## 1 HLo_3… 2265269 16365 277 0 1826 Mira…
## 2 76gf3… 886327 5478 72 0 299 Befo…
## 3 wMYYQ… 2491692 17840 299 0 1306 Cand…
## 4 anhEc… 2173214 37751 572 0 1672 Road…
## 5 Dkcct… 954716 9375 202 0 402 Rome…
## 6 LwelJ… 883488 7721 88 0 200 Vict…
I ran the function with the full list of video ids for the channel and you can use this file videoStatsVS_df.csv under the YouTube tutorial-data folder. Let’s load it
videoStatsVS_df <- as.data.table(read.csv("YouTube tutorial-data/videoStatsVS_df.csv", stringsAsFactors=FALSE))
head(videoStatsVS_df)
## X id viewCount likeCount dislikeCount favoriteCount
## 1: 1 HLo_3GfHjps 2259491 16317 277 0
## 2: 2 76gf3YeGf1s 884766 5463 72 0
## 3: 3 wMYYQkwLDqM 2490140 17827 299 0
## 4: 4 DkcctlhyWEs 953821 9353 202 0
## 5: 5 anhEcKWRzyE 2165304 37656 569 0
## 6: 6 LwelJNogl-M 883017 7711 88 0
## commentCount
## 1: 1826
## 2: 299
## 3: 1305
## 4: 402
## 5: 1673
## 6: 200
## title
## 1: Miranda Kerr and Bruno Mars Backstage at the 2012 Victoria's Secret Fashion Show
## 2: Before I Was a Supermodel: Behati Prinsloo
## 3: Candice Swanepoel Meets the Royal Fantasy Bra
## 4: Romee Strijd on Becoming a Victoria’s Secret Angel
## 5: Road to the Runway: Episode 1 – Castings
## 6: Victoria’s Secret Angel Outtakes 2014
Let’s see which video was most popular:
The most view counts are:
videoStatsVS_df %>% arrange_(~ desc(viewCount)) %>%
top_n(n = 5) %>%
select(title, viewCount, likeCount, favoriteCount, commentCount, id)
## title viewCount likeCount favoriteCount
## 1 Why Alessandra Ambrosio loves her body 773575 3250 0
## 2 You've Got Male 422072 881 0
## 3 Wild & Beautiful: Behind the Scenes 338447 1115 0
## 4 Why Lindsay Ellingson loves her body 178581 320 0
## 5 Why Emanuela DePaula loves her body 118804 339 0
## commentCount id
## 1 23 EMOxRWd-Q1k
## 2 19 drxva8UM-zM
## 3 17 nmTT42Q0bsI
## 4 17 B3QSmqcTojQ
## 5 11 jANpsZGKPXA
Let’s have a look at it!
The most likes are:
videoStatsVS_df %>% arrange_(~ desc(likeCount)) %>%
top_n(n = 5) %>%
select(title, likeCount, viewCount, favoriteCount, commentCount, id)
## title likeCount viewCount favoriteCount
## 1 Why Alessandra Ambrosio loves her body 3250 773575 0
## 2 Wild & Beautiful: Behind the Scenes 1115 338447 0
## 3 You've Got Male 881 422072 0
## 4 Why Emanuela DePaula loves her body 339 118804 0
## 5 Why Lindsay Ellingson loves her body 320 178581 0
## commentCount id
## 1 23 EMOxRWd-Q1k
## 2 17 nmTT42Q0bsI
## 3 19 drxva8UM-zM
## 4 11 jANpsZGKPXA
## 5 17 B3QSmqcTojQ
The most comments got:
videoStatsVS_df %>% arrange_(~ desc(commentCount)) %>%
top_n(n = 5) %>%
select(title, commentCount, viewCount, likeCount, favoriteCount, id)
## title commentCount viewCount likeCount
## 1 Why Alessandra Ambrosio loves her body 23 773575 3250
## 2 You've Got Male 19 422072 881
## 3 Wild & Beautiful: Behind the Scenes 17 338447 1115
## 4 Why Lindsay Ellingson loves her body 17 178581 320
## 5 Why Emanuela DePaula loves her body 11 118804 339
## favoriteCount id
## 1 0 EMOxRWd-Q1k
## 2 0 drxva8UM-zM
## 3 0 nmTT42Q0bsI
## 4 0 B3QSmqcTojQ
## 5 0 jANpsZGKPXA
Let’s have a look at titles now and see what we can find there.
Continuing with the “girl power” theme, let’s compare VS to another powerhouse, US Vogue.
Following the procedure described earlier, I downloaded video stats for both VS and US Vogue channels and merge them into one file, videostats_All.csv
Further analysis will include manipulation with text, so we will need tidyverse and tidytext packages
library(tidyverse)
library(tidytext)
you can either download the channel stats yourself (see above) or use videostats_All.csv
Channel ids are: * Americanvogue = UCRXiA3h1no_PFkb1JCP0yMA * VICTORIASSECRET= UChWXY0e-HUhoXZZ_2GlvojQ
To load the existing file let’s do this
videostats_All <- as.data.table(read.csv("YouTube tutorial-data/videostats_All.csv", stringsAsFactors=FALSE))
head(videostats_All)
## date
## 1: 2016-01-05
## 2: 2016-01-18
## 3: 2016-01-19
## 4: 2016-01-22
## 5: 2016-02-02
## 6: 2016-02-11
## title
## 1: Inside the Brooklyn Home of Artist Mickalene Thomas
## 2: Watch Irene Kim’s Japanese Hot Springs Adventure
## 3: 73 Questions With Derek Zoolander
## 4: Watch Pop Star in the Making Zella Day Perform a Heartbreaking Ballad
## 5: Cleaning House With Organizing Guru Marie Kondo
## 6: Here’s What It’s Like to Be Lucky Blue Smith During Fashion Week
## video_id viewCount likeCount dislikeCount commentCount source
## 1: SMX60fh5u7o 29346 1304 7 27 Vogue
## 2: r-shJpCflvQ 59103 1206 29 38 Vogue
## 3: H4q0K561WXs 2395735 34455 1817 1754 Vogue
## 4: wg19p1cDSRE 58920 2104 9 101 Vogue
## 5: z3OXvQZe7g8 392529 4121 870 171 Vogue
## 6: LIKTr9vXRQI 141677 2853 52 131 Vogue
Let’s have a brief look at the data. We are going to use the stargazer package which is fantastic for generating “academic” looking results and describeBy function from the psych package that generates statistics by a grouping variable. We will group variables by channel is.
library(stargazer)
stargazer(videostats_All[,.(viewCount, likeCount, commentCount, dislikeCount)], median=TRUE, digit= 1, type = "text")
##
## ============================================================================================
## Statistic N Mean St. Dev. Min Pctl(25) Median Pctl(75) Max
## --------------------------------------------------------------------------------------------
## viewCount 456 1,199,256.000 2,452,764.000 6,477 115,134 367,336.5 1,082,360.0 20,665,383
## likeCount 456 25,219.360 66,559.910 145 1,992 5,040.5 18,092 747,921
## commentCount 455 1,150.145 3,401.707 4.000 85.500 202.000 707.500 31,101.000
## dislikeCount 456 727.618 2,297.890 1 33 88.5 425 29,639
## --------------------------------------------------------------------------------------------
##
## =
## 1
## -
library(psych)
results<-describeBy(videostats_All[, .(viewCount, likeCount, commentCount, dislikeCount)],
group=videostats_All$source, digits=1, mat=TRUE)
results[,c(1:7, 10:11)]
## item group1 vars n mean sd median min
## viewCount1 1 Vogue 1 307 1512729.9 2874266.3 478164 6477
## viewCount2 2 VS 1 149 553373.0 889076.4 208598 31747
## likeCount1 3 Vogue 2 307 35448.6 79096.3 10367 145
## likeCount2 4 VS 2 149 4143.1 4528.7 2468 722
## commentCount1 5 Vogue 3 306 1611.7 4067.6 359 4
## commentCount2 6 VS 3 149 202.3 235.9 111 17
## dislikeCount1 7 Vogue 4 307 1042.1 2745.7 211 1
## dislikeCount2 8 VS 4 149 79.7 136.8 43 5
## max
## viewCount1 20665383
## viewCount2 5348750
## likeCount1 747921
## likeCount2 37661
## commentCount1 31101
## commentCount2 1673
## dislikeCount1 29639
## dislikeCount2 1189
Just a reminder that: * mean: average * st. dev: is a measure of variation in the data compared to the average * min and max: extreme values
Likely that the number of views relates to the number of likes: the more people view the video, the more they “like” it. We can do a correlation for this using corr.test function from the same psych package
results<-corr.test(videostats_All[, .(viewCount, likeCount, commentCount, dislikeCount)], use = "complete",method="pearson",adjust="holm",
alpha=.05,ci=FALSE)
results$r
## viewCount likeCount commentCount dislikeCount
## viewCount 1.0000000 0.8130140 0.8078345 0.8286073
## likeCount 0.8130140 1.0000000 0.9469651 0.6756837
## commentCount 0.8078345 0.9469651 1.0000000 0.7457451
## dislikeCount 0.8286073 0.6756837 0.7457451 1.0000000
Or we can have a plot it on a graph with the help of gglot2 and gridExtra
library (ggplot2)
library(gridExtra)
p1=ggplot(data = videostats_All[-1, ]) + geom_point(aes(x = viewCount, y = likeCount))
p2=ggplot(data = videostats_All[-1, ]) + geom_point(aes(x = viewCount, y = dislikeCount))
p3=ggplot(data = videostats_All[-1, ]) + geom_point(aes(x = viewCount, y = commentCount))
grid.arrange(p1, p2, p3, ncol = 2)
You just cannot LOVE Adriana Lima! But move on….
As you see the title column has a title that describes the video. It is logically to assume that title is the first to attract attention of the viewer. Let’s have a closer look and see if we can identify specific words.
Let’s tokenize the title, clean it from stop words and calculate frequencies of words in the title. Frequency is calculated as the number of times a particular word is used in the title compared to the total number of different words used in the channel.
title_words_All_Source<-videostats_All %>%
as.tibble() %>%
unnest_tokens(word, title) %>%
anti_join(stop_words) %>%
count(source, word, sort = TRUE) %>%
left_join(videostats_All %>%
group_by(source) %>%
summarise(total = n())) %>%
mutate(freq = n/total)
## Joining, by = "word"
## Joining, by = "source"
I saved the output into title_words_All_Source.csv file , so you can load it and continue working with a tokenized version of the text.
title_words_All_Source <- as.data.table(read.csv("YouTube tutorial-data/title_words_All_Source.csv", stringsAsFactors=FALSE))
head(title_words_All_Source)
## source word n total freq
## 1: Vogue vogue 248 315 0.7873016
## 2: VS victoria’s 100 148 0.6756757
## 3: VS secret 98 148 0.6621622
## 4: Vogue met 84 315 0.2666667
## 5: Vogue gala 82 315 0.2603175
## 6: Vogue beauty 61 315 0.1936508
We can compare frequencies of the word usage between VS and Vogue:
frequencyVS_Vogue <- title_words_All_Source %>%
select(source, word, freq) %>%
spread(source, freq) %>%
arrange(VS, Vogue)
head(frequencyVS_Vogue, 10)
## word Vogue VS
## 1 60 0.003174603 0.006756757
## 2 backstage 0.003174603 0.006756757
## 3 butt 0.003174603 0.006756757
## 4 david 0.003174603 0.006756757
## 5 fittings 0.003174603 0.006756757
## 6 it’s 0.003174603 0.006756757
## 7 lima’s 0.003174603 0.006756757
## 8 lip 0.003174603 0.006756757
## 9 music 0.003174603 0.006756757
## 10 official 0.003174603 0.006756757
So thqt it make more sense let’s show word usage by building up the wordcloud. This will allow us to see the most prominent words immediately.
words_VS <-title_words_All_Source%>%
filter(source=="VS")%>%
select(word, n)
words_Vogue <-title_words_All_Source%>%
filter(source=="Vogue")%>%
select(word, n)
library(wordcloud)
wordcloud(words_VS$word, words_VS$n)
wordcloud(words_Vogue$word, words_Vogue$n)
library(wordcloud2)
wordcloud2(data=words_VS, size = 1,
color = "random-light", shape = 'star')
Moving on further into the text analysis, let’s have a look at the sentiment.
Sentiment analysis is a tool that assess the tone and the voice of the text. It uses pre-made dictionaries with words classified into categories. Different dictionaries use different categories. For example, the nrc lexicon categorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. We can get the list of words in the dictionary and there further options to adjust the list:
get_sentiments("afinn") #assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.
## # A tibble: 2,476 x 2
## word score
## <chr> <int>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ... with 2,466 more rows
get_sentiments("bing") #categorizes words in a binary fashion into positive and negative categories.
## # A tibble: 6,788 x 2
## word sentiment
## <chr> <chr>
## 1 2-faced negative
## 2 2-faces negative
## 3 a+ positive
## 4 abnormal negative
## 5 abolish negative
## 6 abominable negative
## 7 abominably negative
## 8 abominate negative
## 9 abomination negative
## 10 abort negative
## # ... with 6,778 more rows
get_sentiments("nrc") #categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.
## # A tibble: 13,901 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ... with 13,891 more rows
We will plot the results and will use ggplot2 package for this.
library(ggplot2)
title_words_All_Source %>%
inner_join(get_sentiments("nrc")) %>%
group_by(source, sentiment) %>%
summarize(n = n()) %>%
ggplot(aes(x = source, y = n, fill = source)) +
geom_bar(stat = "identity", alpha = 0.8) +
facet_wrap(~ sentiment, ncol = 5)
On the next steps, we can have a look at comments for different sources and compare the word usage in the audience. Downloading all comments takes time.. You can do it on your own using the following function:
commentsVS = lapply(as.character(videosVS$video_id), function(x){
get_comment_threads(c(video_id = x), max_results = 50)
)
Please note that if the number of comments is too high, you may get errors.. To avoid, limit the max_results.
or you can use the file I created for you. I used VS (you should already have it!), US Vogue and British Vogue channels for the dataset comments_All.csv
comments_All <- as.data.table(read.csv("YouTube tutorial-data/comments_All.csv", stringsAsFactors=FALSE))
head(comments_All)
## comment
## 1: hey you guys!\xd0___ is this on january 7 or on july 1st (sorry for my english i am from the netherlands)
## 2: What happens in July 1?
## 3: Check out my model work at Facebook.com/journeytofame
## 4: first okno
## 5: first okno
## 6: hey
## source
## 1: VS
## 2: VS
## 3: VS
## 4: VS
## 5: VS
## 6: VS
As you see we need to clean the text. it does look messy…..
replace_reg <- "https://t.co/[A-Za-z\\d]+|http://[A-Za-z\\d]+|&|<|>|RT|https"
unnest_reg <- "([^A-Za-z_\\d#@']|'(?![A-Za-z_\\d#@]))"
comments_words<-comments_All %>%
as.tibble() %>%
unnest_tokens(word, comment, token = "regex", pattern = unnest_reg) %>%
filter(!word %in% stop_words$word,str_detect(word, "[a-z]")) %>%
count(source, word, sort = TRUE)
Now let’s apply the sentiment:
library(ggplot2)
comments_words %>%
inner_join(get_sentiments("bing")) %>%
group_by(source, sentiment) %>%
summarize(n = n()) %>%
ggplot(aes(x = source, y = n, fill = source)) +
geom_bar(stat = "identity", alpha = 0.8) +
facet_wrap(~ sentiment, ncol = 5)
Let’s see what topics we can find in the description of the videos and compare them across three channels, US Vogue, British Vogue and VS. Does the description of the video affect the number of likes, views and dislikes?
For the topic modeling we will go to the topicmodels package. It does not work well with tidytext, so we will use the tm package.
We will use the dataset we have already videosVS - it has a list of all videos from the channel as well as its description. I also downloaded the same data from US Vogue and British Vogue. I saved it to the description_All.csv file for you to use.
library(topicmodels)
library(tm)
description_All <- as.data.table(read.csv("YouTube tutorial-data/description_All.csv", stringsAsFactors=FALSE))
description_All$source<-as.factor(description_All$source)
head (description_All)
## description
## 1: French model and girl of the moment Jeanne Damas kicks off the new Vogue Shops series with a trip to one of France's iconic pharmacies, where she reveals ...
## 2: Gigi Hadid and Bella Hadid, cover stars of Vogue's March 2018 issue, take the sculpture challenge to find out just how well they know each other – but who will ...
## 3: Always keep your heels, head and standards high.” Models Alecia Morais, Azza Slimene, Demy de Vries and Loane Normand demonstrate a number of Coco ...
## 4: "I'm very grateful to be here, and that overpowers any of the negative elements that come with fame." Vogue wakes up with Ariana Grande, cover star of the July ...
## 5: What does Victoria Beckham carry inside her bag? In this episode of British Vogue's 'In The Bag,' our October cover star shows us the items she simply can't live ...
## 6: British Vogue spends the day with Seven "Posh" Boys, keep watching to see what was in store for the day! THERE was a merry twinkle in the eye of those in ...
## video_id source
## 1: X69Wd3b9Y04 VogueUK
## 2: r5rYxGK_VtA VogueUK
## 3: kTg8gKfq8TI VogueUK
## 4: n2wIqXBz4os VogueUK
## 5: XKMZR5uXAvk VogueUK
## 6: FOMAARZ8pmo VogueUK
Topic modeling is a sophisticated technique that is used to classify text and identify themes.
The most promising and widely used algorith at this time is Latent Dirichlet Allocation (LDA). The beauty of algorithm is that it allows overlapping between topics (=themes). It assumes that each document is a collection of topics. These topics can also be shared among several documents. How do we identify which topics are covered in the document?
We look at the document and identify specific words and in the document that relate to a particular topic. How do we do this? We just randomly assigns words to topics.
Do we accurately know how many topis are there? No. We Just Do It and experment with different number of topics /documents/ words again and again and again to see what works best for our text. The whole processes is based on probabilities assessment, probability of a topic in a document and probability of a word in a topic. This process goes iteratively until it reaches some stable state. The document is assigned a topic(s) based on the proportion of the words assigned to each topic in this document and words are assigned to the topic based on the proportion of words assigned to the topic.
The math side of the algorithm is quite heavy. Indeed, you oare welcome to dig out the original work of Blei et al. 2003 “Latent Dirichlet Allocation”, but I would suggest that we try to do some practical things to get you going!
So. The tm package.
It has a different philosophy than tidytext.
The main concept for tm way of managing documents is a Corpus
Corpus: collection of documentsSo, let’s load our data there and point to the text
corpus = Corpus(VectorSource(description_All$description))
Now that we have our corpus to analyse we need to clean it.
corpus <- tm_map(corpus, tolower) #convert to lowercase
corpus <- tm_map(corpus, removePunctuation) #remove punctuation
corpus <- tm_map(corpus, removeWords, stopwords("english")) #remove stopwords
While the tidytext package has a very useful unnest_tokens function that not only tokenize the text, but also does some basic cleanup, including converting to lowercase, the tm package requires several steps.
Next is stemming: you can think about it as a grand-parent of tokenization… Well, not exactly. Linquists see a really big difference between the two and add lemmazation to the discussion as well. Stemming usually “chops” of the word to get a “unique” form of the word. Lemmazation is more sophisticated and looks into vocabulary and morphological analysis of the word. Tokenization is different: it is segmenting the text where the segment can be a word, a sentence, lines, etc. Let’s do it!
corpus <- tm_map(corpus, stemDocument)
Next, generating a document-term-matrix and removing rare words:
Document-term-matrix (dtm) is a matrix that describe frequencies of words in a document. Rows are documents and columns are words (=terms)
dtm = DocumentTermMatrix(corpus)
The first ten terms from our matrix: dama, franc, french, girl, icon, jeann, kick, model, moment, new
DTM can get big… very big. But not all terms in the dtm are that important. Some just add noise and take processing time.
# Remove sparse terms
dtm = removeSparseTerms(dtm, 0.997) #0.997 is sparsity: the function removes only terms that are more sparse than 0.997
Let’s review most frequent terms and draw a wordcloud:
findFreqTerms(dtm, 1000)
## character(0)
and finally create a data.frame for further analysis
labeledTerms = as.data.frame(as.matrix(dtm))
labeledTerms = labeledTerms[rowSums(abs(labeledTerms)) != 0,]
Now lets have a look at what we can find the LDA. We need to specify the number of topics - it is hard to do, but we need to start with something. We will have a look at approaches to get more accurate number of topics. So? We need a number….
news_lda <- LDA(labeledTerms, k = 5, control = list(seed = 13))# set a seed so that the output of the model is predictable
We are back to the tidytext package to make the output more readable
news_topics <- tidy(news_lda, matrix = "beta")
news_top_terms <- news_topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
news_top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() + theme_bw()
Now we need to have a look at terms in each topic to see what topic they can “make”… You rarely get the number of topics right in the first run!
There are ways to mathematically estimate the number of topics. They are quite advanced and take time and resources to run. I ran it for us using the ldatuning package and here are the results ldaResults. The function is indeed resource intensive and it took half an hour to process on this “small” dataset.
The function to use there is FindTopicsNumber( labeledTerms, topics = seq(from = 2, to = 80, by = 1), metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"), method = "Gibbs", control = list(seed = 77), mc.cores = 2L, verbose = TRUE ) and it is easier to work with the results on the plot
FindTopicsNumber_plot(result)